Feature #8: Similarity Measure Between DNA Samples
Implement the "Similarity Measure Between DNA Samples" feature for our "Computational Biology" project.
We'll cover the following
Description#
The DNA of an alien species consists of a sequence of nucleotides, where each nucleotide is represented by a letter. We received two such DNA samples, and we need to measure the extent of similarity—also known as the edit distance—between them. The prevalent measure of similarity between these two samples is the minimum number of edits that are required to convert one DNA sample to the other.
Note: We are only permitted to insert, delete, or update a nucleotide in a DNA sample.
Given two DNA samples as strings, sample1 and sample2, we have to return the minimum number of operations that are required to convert sample1 to sample2.
The following examples may clarify this problem:
1 of 2
2 of 2
Solution#
We’ll compare the sample1 string with the sample2 string, one character at a time. If the characters at the current position match, then no edit operation will be required.
On the other hand, if the characters at the current position in the two strings don’t match, we can perform one of the following three operations, whichever will result in the fewest edit operations:
- Insert a character to
sample1at the current position. - Delete the character in
sample1at the current position. - Replace the character in
sample1at the current position with the character at the current position insample2.
The choice shown above can’t be made on the basis of local optimality. The actual cost of all of the choices given above must consider the edit cost for matching the rest of the characters in sample1 with the rest of the characters in sample2.
Let’s look at an example where we compare sample1 = abcdef with sample2 = azced. If we start the comparison on the first characters of the two strings, a, there will be a match. The second characters of the two strings are b and z, respectively. Over here, we can perform one of these three operations:
- If we insert
zintosample1, then we must find the edit distance between the stringsbcdefandced. - If we replace
bwithz, then we must find the edit distance between the stringscdefandced. - If we delete the character
b, then we must find the edit distance between the stringscdefandzced.
Once we’ve considered all of these options, we must return the minimum number of edit operations that are required to convert sample1 to sample2. This will lead to a recursive problem formulation, which has several overlapping subproblems and optimal substructure. Therefore, we will solve this problem using dynamic programming.
We will use the concept of memoization, where we use an array to store the already solved subproblems. We have to match all the nucleotides of the two samples that are given. For this, we can use a 2D array to store the edit distances. Let’s call this array D, whose size is defined by n and m, the lengths of sample1 and sample2, respectively.
The value in D[i][j] will represent the edit distance between the substrings formed by the first i nucleotides of sample1 and the first j nucleotides of sample2. To compute the edit distance D[i][j], we can perform the following three edit operations:
- Delete:
D[i-1][j], - Insert:
D[i][j-1], - Replace:
D[i-1][j-1]
We also need to take care of a couple of cases to compute the edit distance, D[i][j]:
-
If the nucleotide at
sample1[i]matches the nucleotide atsample2[j]: -
If the nucleotide at
sample1[i]does not match the nucleotide atsample2[j]:
Based on these computations, we will compute an edit distance D[i][j] like this:
1 of 4
2 of 4
3 of 4
4 of 4
We also have a couple of base cases, which are as follows:
-
If
sample2is empty, we can remove all the nucleotides ofsample1to make it empty too, that is,D[i][0] = i. -
If
sample1is empty, we have to insert all the nucleotides ofsample2intosample1,D[0][j] = i.
Now that we know the base cases and the general formulas used to compute an edit distance D[i][j], we can look at the computation steps in the illustration below:
1 of 34
2 of 34
3 of 34
4 of 34
5 of 34
6 of 34
7 of 34
8 of 34
9 of 34
10 of 34
11 of 34
12 of 34
13 of 34
14 of 34
15 of 34
16 of 34
17 of 34
18 of 34
19 of 34
20 of 34
21 of 34
22 of 34
23 of 34
24 of 34
25 of 34
26 of 34
27 of 34
28 of 34
29 of 34
30 of 34
31 of 34
32 of 34
33 of 34
34 of 34
Let’s look at the code for this solution:
Complexity measures#
| Time complexity | Space complexity |
|---|---|
Let and be the number of nucleotides in sample1 and sample2, respectively.
Time complexity#
Since we are using an array that stores the results for all the subproblems, we can conclude that we will not have more than subproblems. Therefore, the time complexity will be .
Space complexity#
The algorithm will use space for the array.
Feature #7: Detecting a Protein
Feature #9: Kth Missing Gene